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Abstract. The particle Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm 
which operates on the extended space of the auxiliary variables generated by an interacting 
particle system. In particular, it samples the discrete variables that determine the particle 
genealogy. We propose a coupling construction between two particle Gibbs updates from 
different starting points, which is such that the coupling probability may be made arbitrary 
large by taking the particle system large enough. A direct consequence of this result is 
the uniform ergodicity of the Particle Gibbs Markov kernel. We discuss several algorithmic 
variations of Particle Gibbs, either proposed in the literature or original. For some of 
these variants we are able to prove that they dominate the original algorithm in asymptotic 
efficiency as measured by the variance of the central limit theorem's limiting distribution. 
A detailed numerical study is provided to demonstrate the efficacy of Particle Gibbs and 
the proposed variants. 
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1. Introduction 

PMCMC (Particle Markov chain Monte Carlo) is a new set of MCMC algorithms dis- 
covered by Andrieu et al. (2010) which has attracted considerable attention in Statistics. It 



has in a relatively short span of time generated a series of diverse research contributions, 



whether methodological (Silva et al. , 2009 Whiteley et al. 2010; Chopin et al. 2012 Lind- 



sten et al. , 2012) or applied, the latter in domains as diverse as Ecology (Peters et al. , 2010) 



Electricity Forecasting (Launay et al. , 2012), Finance (Pitt et al. 



(Golightly and Wilkinson 2011), study of social networks (Everitt 



2012), systems Biology 



2012), Hydrology (Vrugt 



et al. , 2012), among other fields. One appeal of PMCMC is that it makes it possible to 
perform "plug-and-play" inference for complex hidden Markov models, that is, the only re- 
quirement is that one needs to be able to sample from the Markov transition of the hidden 
chain, which is most cases a non demanding requirement, in contrast to previous approaches 
based on standard MCMC. 

The specificity of PMCMC is that each MCMC step includes the generation of a complete 
interacting particle system, that is, the set of random variables generated when running a 



particle filtering (also known as Sequential Monte Carlo (SMC)) algorithm; see Doucet et al. 
(2001) Del Moral (2004) and Cappe et al. (2005) for general references on particle filtering. 



Several instances of PMCMC may be understood as exact Monte carlo approximations of an 
ideal algorithm, that is to say they may be analysed as noisy versions of an ideal algorithm 
where some intractable quantity in the ideal algorithm is replaced by an unbiased Monte 
carlo estimate, which is computed using the interacting particle system. This is the approach 



followed in Andrieu and Vihola (2012 ). The term exact highlights the fact that, despite being 



an approximation of an ideal algorithm, PMCMC samples exactly from the distribution of 
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interest. However, this interpretation does not appear to be suitable for those variants of 
PMCMC that involves a Particle Gibbs (also known as conditional Sequential Monte Carlo 
(CSMC)) step. While the Particle Gibbs step also generates a complete interacting particle 
system at each iteration, it does so conditionally on the trajectory for one particle being 
fixed, and it does not replace an intractable quantity of an ideal algorithm with an unbiased 
estimator. 

One of the objectives of this paper is to undertake a theoretical study of Particle Gibbs. 
We design a coupling construction between two Particle Gibbs updates that start from two 
different trajectories and using this coupling construction, we are able to establish that 
the coupling probability may be made arbitrary large if the number N of particles is large 
enough. A direct consequence is the uniform geometric ergodicity of the Particle Gibbs 
Markov transition kernel. 

We also derive several variants of the Particle Gibbs algorithm by considering alternat- 
ive resampling schemes, which is an unexplored and important facet of PMCMC. We also 



discuss backward sampling, an extra step for Particle Gibbs proposed by Whiteley (2010). 
We show that all these variants of Particle Gibbs (including the original algorithm) define 
reversible kernels. Regarding backward sampling, we establish that the Particle Gibbs kernel 
which incorporates backward sampling dominates Particle Gibbs without it, in asymptotic 
efficiency as measured by the variance of the CLT. Finally, we present a detailed numerical 
comparison of all these different algorithms. 



and |3] set up respectively the notations 
gives a probabilistic description of the 



The plan of the paper is the following. Sections |2 
and the considered Feynman-Kac model. Section 4 
corresponding particle system, and Section [5] derives the conditional distribution associated 
to the Particle Gibbs step. Section [6] describes the conditional resampling algorithms cor- 
responding to either residual and systematic resampling. Section [7] develops the coupling 
construction mentioned above. Section |8] introduces a variant of Particle Gibbs which we 
call "forced move". Section [9] discusses the BS (backward sampling) step, establishes that 
Particle Gibbs, either with or without an extra BS step, defines a reversible kernel, and 
uses this result to establish that Particle Gibbs with BS dominates Particle Gibbs without. 



Section 10 is a numerical study of the several variants of Particle Gibbs algorithm. Section 



11 concludes. The Appendix contains the proofs of technical results not included in the main 



body of the paper. 



2. Notations 



For m < n, we denote by m : n the range of integers {m, . . . , n}, and we use extensively 
the semi-colon short-hand for collections of random variables, e.g. Xq-t = {Xq, . . . X^), 
Xl''^ = {Xl, . . . , X^), and even in an nested form, X^:^ = (Xq-^, . . . , X^'^); more generally 
X]^, where f is a vector in N"^ will refer to the collection (X")„gt,. These short-hands are 
also used for realisations of these random variables, which are in lower case, e.g. xo:i or xj'^ . 
The sub- vector containing the t first components of some vector Zt is denoted [^tIj- 

For a vector r^'-^ of probabilities, r" e [0, 1] and Yln=i^"^ ~ -'-' ^^ denote by A^(r^'^) the 
multinomial distribution which produces outcome n with probability r", n G 1 : X. For 
reals x,y let x\/ y = max(a;, y) and x Ay = min(x, y). The integer part of x is [x\ , and the 
positive part is x^ = x V 0. The cardinal of a finite set C is denoted as \C\. 
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For a complete separable metric space X, we denote by V{X) the set of probability 
distribution on X. For a probability measure // G V{X), a kernel K : X —)■ V{X) and 
a measurable function / defined on X, we use the following standard notations: fi{f) = 
J^dfif, Kf is the application x — )■ f^ K{x,dx')f{x'), and fiK is the probability measure 
{lJ,K){A) = J^ fi{dx)K{x,A). The atomic measure at a G A' is denoted 6a{dx). We denote 
hj fi ^ K the measure n{dx)K{x, dx') on the product space X x X. 

3. Model 

Let {Xt)t>o be a discrete-time A'-valued Markov chain, with initial law mo{dxo) = mo{xQ) dxo, 
and transition law mt{xt-i,dxt) = mt{xt-i,Xt) dxt- Let {Gt)t>o be a sequence of A' — )■ M"*" 
potential functions. In the context of hidden Markov models, Gt{xt) = g{xt,yt), the density 
(with respect to some dominating measure dy) of observation yt of the 3^-valued random 
variable Yf, conditional on state X^ = Xt- 

It is convenient in this work to rewrite this Feynman-Kac model as a Feynman-Kac path 
model, that is a model operating on trajectories. Thus, we define Zt = xq.i, Zt = Xo.t, taking 
values in X^^^, and slightly abusing notation, we extend the domain of Gt from X to X^^^ 
as follows: Gt{zt) = Gt{xt)- The Z^'s form a time inhomogeneous Markov kernel, with initial 
law qoi^dzo) = mo^dxo), and transition 

qt{zt-i,dz[) = 5^^_,{dxQ.t_^)mt{xt-i,x[)dx[ 

i.e. keep all of Zt_i and append new state Xt, from Markov transition mt{xt„i,dxt). The 
associated Feynman-Kac path measures are then traditionally defined as 



1 * 

(3.1) Qt{dzt) = Qt{dxo;t) = —mo{dxo) TT {Gs-i{xs-i)ms{xs-i, dxs)} 

^+ 

where 2^( > is defined as 

Zt= I 'mQ{dxQ)W{Gs-i{xs-i)'ms{Xs-i,dXs)} . 
Jx^+^ f=i 

4. Probabilistic description of the interacting particle system 

The aim of this section is to derive the joint distribution of all the random variables 
generated in the course of the execution of an interacting particle algorithm that targets the 
Feynman-Kac path measures given in the previous section. 

The particle representation Q^ [dzt) is the empirical measure defined as, for t > 0, 

1 ^ 
Q'^{dz,) = -Y,5z^{dz,) 

11=1 

where the particles Z]'^ = {Z^,...,Z^) are defined recursively as follows. First, Zq'^ is 
obtained by sampling N times independently from mo{xQ)dxo. To progress from time t to 
time t + 1, t > 0, the pair {Aj'-^ , Z^j^) is generated jointly from 

N 

gt{zl--'',dal--'')l[qt+,{zf,dz^^,), 

ra=l 
3 



n-.N _ ^l:N 
't ~ ^t 
A:N J M:N\ 



conditionally on Z}'^ = z}'^ , where the A"'s, n & 1 : N, jointly sampled from the resampling 



distribution Qt{zj ,dAj ) are the ancestor variables, i.e. A" is the label of the particle at 
time t which generated particle Z^"^^ at time t + 1. 

A common assumption on the resampling distribution gt is the following. 

Assumption (MR). The resampling distribution gt{zl'^ ,daj'^), t>0, has probability dens- 
ity (abusing notations and using the same symbol gt for the probability density): 

N 
n=\ 

where the W^ 's are the normalised weights 

(4.1) wnzl--"") = ^^^ , nel:N. 

In other words, the A^ 's are independent draws from M. {W}'^ {zl'^)^ . 

Analysis under Multinomial resampling, which is an important sampling strategy in the 
literature on particle filtering, is much simpler than under the more general strategies below. 
That is, the validity of the particle Gibbs sampler may be established for other resampling 
schemes, provided they follow the more general assumption (R). 

An A^— cycle c : (1 : A^) — ;■ (1 : A^) is a permutation such that c{n) = [n + k) (mod A^), 
for all n G 1 : A^, and some k & 1 : N. Note that if c is a cycle then so is c~^. 

Assumption (R). The discrete distribution gt{zl''^ , daj'^) = gtiz}''^ , a\''^)da\''^ , which takes 
values in (1 : N)®^ , is (a) marginally unbiased, that is, the marginal probability that A^ = m 
is Wl"{zl'-'^), for m,n E I : N; and (b) is cycle-invariant, that for any two N— cycles c, d 
we have gt{zl'^,a^ ' ) = gt{zl''^ ,a\''^) and gt{zl''^ ,a\'-^) = gt{zl ' \c~'^{a\'^)). 

Assumption (Rb) is a technical condition that facilitates the formalisation of Particle 
Gibbs. The first part of (Rb) states that a sample from gt^z}'^ , ■) can be arbitarily cycled 
without changing its law. The second part of (Rb) states that if A\''^ is a sample from 
gt{zl'-^,-) then c~^(Al), . . . ,c~^{A^) is a sample from gt{z^ ' ',■). We will discuss both 
these points in more in detail in Section [6J For now, we note that, under Assumption (R), 
we can write, for any n E 1 : N, 

(4.2) g,izr-^, an = wfizngtiz^, a^^^l^r = <), 

where g1{zl'^ ^ai \A^ = a") is the probability density of A^' , the collection of A^ 
variables for m ^ n, conditional on A" = a". 

The law of the collection of random variables {Zq':^ , Aq:^_^) generated from time to some 
final time T > 1 is therefore 



N 



t=l L n=l 

Proposition 1. Under Assumption (Ra), one has 



(4.3) E, 



tf^ 



T r , N 



t=0 L n=l 



Z. 



This result has been estabhshed previously in e.g. Del Moral (2004) under Assumption 



(MR) but the author only really used the marginal unbiased property as formalised in As- 
sumption (Ra). 



5. The Conditional Particle Filter 

This section is concerned with the definition of the Markov kernel of the Particle Gibbs 
algorithm that has Qt{dzt) as its invariant measure; i.e. the Markov kernel operates on the 
path space in the sense that it maps X"^^^ — )■ V{X'^^^). We call this Markov kernel the CPF 
(Conditional Particle Filter) kernel for reasons to be made obvious below. 

In order to define the CPF kernel and prove it leaves Qt{dzt) invariant, we commence first 
with the definition of the following extended distribution 7r|f whose sampling space is the 
sampling space of i)^ augmented to include a discrete random variable A^* G 1 : A^, 






(5.1) 




N 



n=\ 



The fact that the expression above does define a correct probability law (with a density that 



integrates to one) is an immediate consequence of the unbiasedness property given in (4.3); 
note that cycle invariance is not needed to show vr^ integrates to one. 



Proposition 2. Under Assumption (Ra), the distribution n^ is such that the marginal 
distribution of the random variable Z^ 



Zjrp 



IS 



ET- 



This Proposition has been established by Andrieu et al. (2010) under Assumption (MR), 



but extension to Assumption (Ra) is trivial. In particular, the expectation of functions of Z^ 
may be computed by integrating out the variables in the order n*, x^ 



1:N ^1:N 



.X 



1:N ^1:N 







,x 



1:N 



Z^ as follows. 



Note that cycle invariance is not needed in this proposition. 

Given a sample from n^ , we can trace the ancestry of the variable Z^ 
Let B* for t G : T be the index of the time t ancestor particle of trajectory Z^, which is 

defined recursively backwards as B^ = N*, then B^ = A^ '+\ for t = T — 1, . . . , 0. Finally, 
let Z* 



A:N\* 



for t G : T, so that Z^ is precisely the first t + 1 components of Zj 



Let Z^ be the ordered collection of the A^ — 1 trajectories Z" such that n ^ B* 

(i.e. n 7^ A^* when t = T), and Zq.^ 



1:N\*: 



[.^0 , . . . , ^2" 



Zrp ^*). Define similarly Aq't-i 



|l:Af\* 



,l:7VVx 



l:iV\* 



,Bt 



{Af^'"'^*, . . . ,A^Zi*), where A^'"^* is A]-^ excluding Ap"''\ It is convenient to apply the 
following one-to-one transformation to the argument of ir^: 



(4 



t^n.T^ ^ ^ if^ J v^ V 0"T ' 0"T— 1' 0"T' 0"T— 1' 



T ' OiT— 1' 



With a slight abuse of notation, we identify the law induced by this transformation (going 
to the representation with the b* variables) as tc^ as well: 
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(5.2) Ti'^{dz, 






(I/0/r\,q-\ 1 ^ (ljZr\.rri ^ (ljOr\.rp -i ^ Qjlh I 



T 

n 



T-l 
t=0 



{dz, 



X JJ mo(rfzo 



^l 



a:Af 



iV^t- 



1 1 "'^f 



1 l^t-1 



^t-1 



t=l 



n 



9tU; 



t-l 5 



rf^D 



Passage from (5.1) to (5.2) is straightforward. It is worth noting that the marginal law of 
{Bq.j<_]^,N*) is the uniform law on the product space (1 : A^)-^+^. 



Given a sample Zt = zt from Q^-, consider the following three step sampling procedure 
that transports Zt = zt to define a new random variable Z'rp G X'^'^^. Step 1 is to sample 
the ancestors {BQ.rp_^,N* 



of the random variable Z^.j. 



is to generate the N — 1 remaining trajectories 

A^*) from 



\^0:Tj Bq;T-1i 



yl:N\-k ^l:N\-k 



Zo;T from TC^{dbQ.j<_i,dn*); Step 2 
^o't-i) conditional on the trajectory 



TT. 



N 



(dzo, 



1:N\*: 
T ' 



w"0:T-lK0:T5 "0:T-l5 "' 



l:Nf^l:N 



T 



)), which is also a 



Note that Steps 1 and 2 are both Gibbs step with respect to (5.2). Step 3 is to resample the 
index iV*; hence A^* is sampled from TT^{dn*\zQ:^ , Oq-t-i) ~ M.{Wr^ 
Gibbs step, but this time with respect to (5.1 ); recall that W^{z]i^ 

Z^ is also 



.m-^TyZj, ^ 



It follows from Proposition l2j that the law of Z'j, = Z^* is also Q^. As we detail below, there 
is no specific difficulty in performing Step 2, which is pretty much equivalent to the problem 
of generating a particle filter for T time steps while Step 3 is evidently trivial. We show Step 
1 is actually redundant by proving below that Steps 2 and 3 (applied in succession), for a 
fixed {B^.j._^^ N*), say (1, . . . , 1), induces a Markov kernel, which we call the CPF kernel, 
that maps A"^"*"^ — )■ V{X^^^) with Q^" as the invariant measure; Assumption (Rb) is crucial 
here. We now define the CPF Markov kernel. 

Given a realization z^ for Z^ ~ Qt, we initialize the CPF kernel by arbitrarily choosing a 
realization (6o:r-i5 ''^*) for {BQ.rp_^, N*), which we always set to (1, . . . , 1). Moving z^ through 
CPF kernel may be decomposed into two steps (which were Steps 2 and 3 above): 

CPF-1: The first step where the A^— 1 remaining trajectories are generated, conditional 
on the trajectory Z^, kept as "frozen", and assigned the ancestry {Bq.j,_i,N*) = 
(1, . . . , 1). For t G : T — 1 let z* = [z-f-jt+i. By direct inspection of (5.2), one sees 
that the first step of applying the CPF kernel is equivalent to sampling from the 
conditional distribution 



(5.3) TCj^ldzQ'.rp ,daQ:rp_^\ZQ.j. — ZQ.rp,AQ.rp_^ — [1, . . . ,1) , N —1) 



rrin 



t=i 



N 



l[qt{z:i-,\dz-) 



n=2 



CPF-2: A second step where the index A^* is randomly regenerated to change the 
selected trajectory, which is the realization z^ we started of with, to Z^* . The 
second step is a Gibbs step which updates component A^* from the distribution vr^ 
conditional on all the other variables; hence A^* is sampled from A^(W^'^(zy^)). 
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Sampling from the distribution of step CPF-1 is equivalent to running a particle algorithm 
with A^ particles, while maintaining one trajectory (with all labels equal to 1 in the expression 
above) as "frozen" . One difficulty is to sample from the conditional resampling distribution 
Qt-i- We return to this point in Section [61 

With all these considerations, one sees that the CPF algorithm defines the following kernel 
P^ : X^+^ -^ V{X^+^): for z^ G A'^+\ 



{P^^){z*t) = I PTi4.dz'TMz'T) 



N 



(5.4) = E^. ^#1 ^(4) + y ^^ ^{Z-) 

The following result shows that the ancestry (BQ.rp_^, N*) of the starting trajectory in the 
CPF kernel can indeed be assigned in an arbitrary fashion. 

Proposition 3. Let P^ denote the CPF Markov kernel. Under Assumption (R), the image 
of z^ G X'^'^^ under P^ is unchanged by the choice of (&o:T-1''^*) /^'^ ^^^ realization of 
(-Bg-T-i)^*) ^^ the initialization of the CPF kernel. 

Assumption (Rb), that is cycle invariance, makes it possible to skip the step that would 
sample (i?Q.y_^, A^*). Thus it follows from Proposition [SJ and the discussion preceding the 
definition of steps CPF-1 and CPF-2, that P-^ has invariant probability measure Qt((^^t); 
for any N > 2. To the best of our knowledge, this proposition is original. 

Proof. [Proposition Is] The result is proven by showing that the following two choices for 
(-Bq-t-I) ^*) result in the image of z^ under P^ being unchanged. Choice one is (6o-r-i5 ""-*) — 
(!,...,!) while choice two is (6o:t-i; ''^*) — (^O; • • • , "^r) € (1 : iV)-^"*"^. For the sake of clarity, 
all random variables for choice 2 are given the inverted hat superscript. Let Zq':^ , ^q-t-i 
denote the random variables sampled in step CPF-1 for choice one. Let Zq'j, , ^q-t-i denote 
the random variables to be sampled in step CPF-1 for choice two, that is, is a sample from 
TT^ {dzQ^rp^* , da()\^_\\z^.rp,bQ.rp_i^ dn*). We will show a specific deterministic transformation 
will convert samples from choice one to samples from choice 2 from which we can then 
straightforwardly obtain the stated result. 

Recall, by definition, for t G : T we have Z^ = [z^]t+i ■ We define Cq, . . . ,ct be the 
sequence of cycles satisfying Ct{it) = 1 for all t G : T. It follows that Zq^ = Zq 
is a sample from < {dz^'-^ \{Zi° , ...,Z'^) = z^t, {B*.,t-i. N*) = io-.t)- By (Rb), A^-^ = 
c-^\Af^^), . . . , c^\Af''^) is a sample from tt^ (rfa^^^lZi^^, {Z^\. . . , Z'^) = z*,.,^, {B^o-.t-i, N*) = ^o-.t) • 
It now follows that Zi^''^ = Zf ^^^^^ is a sample from vrf {dzl'-^lZ^'-^ , A},-^ , (^o°, • • • > Z'jf) = z*.j., {B*.^_^,N* 
Continuing in a similar fashion one shows that Zt = Zjf ' is a sample from ir^ {dz]^^ \Zq:^_i, Aq.^_^, [Z 
Zq.j,,{Bq.j^_-j^,N*) = iq-.t)- The result now follows since after step CPF-2 the probabilities 
that either Z^* or Z^*belong to some measurable set V is the same by Assumption (Ra). D 

6. Conditional resampling algorithms 

We have seen in the previous Section that the CPF step involves sampling from the 
conditional resampling distribution g^. Under Assumption (MR) (multinomial resampling), 
sampling g'j. is trivial: draw A^ — 1 times from the multinomial distribution that assigns 
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probability Wp{zl ) to outcome n, n ^ 1 : N. We describe in this section two alternative 



conditional resampling algorithms, one corresponding to the residual resampling (Liu and 



Chen, 1998), and the other corresponding to systematic resampling (Carpenter et al. , 1999). 

As already explained, it is sufficient to describe how to sample from q'j:{zI''^ , daf^ \Al = 1) 

for some fixed t and we shall use the short-hand W" for the resampling weights W^{zl'^), 



N 



defined in (4.1); note that Y.n=i ^" = 1- 



6.1. Condit ional residual resampling. The residual resampling algorithm of Liu and 
Chen (1998) consists in creating A^ off-springs, based on the weights W^'^ , as follows. Let 



the residue r*^ = NW - [NW''\ ioi n e 1 : N, and R = ^^^^ r". First, stack [NW] 
"deterministic" copies of particle label n, sequentially in n. This gives a vector oi N — R 
labels. Then, extend that vector by appending R draws from A^(r^'^/i?). 

To make this algorithm fulfil Assumption (R), it is sufficient to permute randomly the so 
obtained vector of labels. In this way, the resulting algorithm is both marginally unbiased 
and cycle- invariant. 

We now describe an algorithm to draw from the conditional distribution qI{zI'^ ^ da'f'^\A\ = 
1), using the same notations, i.e. r" = NW"' — [NW^l , R = X]n=i ^"- ^^ observe first that, 
under residual resampling, the probability that Aj is set to one of the [A^VFj^J "determin- 
istic" copies of label 1 is [A^PF^J /NW'^. This observation leads to the following algorithm, 
which generates a vector A]'^ of A^ labels, with label one in slot one. A] = 1. 

(1) With probability [A^iy^J /NW^, perform standard residual resampling (without ran- 
dom permutation). Note that in that case, the first label is necessarily equal to 1 
(since NWi > 1). 

(2) Otherwise (with probability r^/NW^), perform the following modified residual res- 
ampling scheme: put label 1 in slot one, then stack recursively [A^VT/^J deterministic 
copies of particle label n, sequentially in n G 1, . . . , A^. Fill the {R — 1) remaining 
slots by independent draws from M.{r^'-^ / R) . 

(3) Randomly permute the A^ — 1 labels in positions 2 : A^. 

Step 1 to Step 3 formally describes how to sample from q1{zI''^ ,da'^'^\A\ = 1), when 
Qt{zl'-^ ^da]''^) is the joint distribution of labels obtained by residual resampling, then ran- 
dom permutation. However, Step 3 does not need to be implemented in practice, because 
(provided residual resampling is used at every iteration) the actual order of particles with 
labels 2 : A^ does not play any role in the operations performed at the following iterations. 



6.2. Condit i onal systematic resampling. The systematic resampling algorithm of Car- 
penter et al. (1999) consists in creating A^ off-springs, based on the weights VT^'^, as follows. 



Let U a uniform variate in [0, 1], and consider a segment of length A^, made of A^ successive 
segments of length NW^, n = 1, . . . ,N. Take s = U, and repeat the following until s > N: 
determine which segment n contain s, add one off-spring to particle n, and increment s, 
s -^ s + 1. 

A particular difficulty of this resampling scheme is that it is order-dependent. Even if we 
would randomly permute the labels at the end of the algorithm (like in residual sampling), we 
may still have gt^z}'^ , da\''^) ^ Qt{zl ' , da\''^) for a certain permutation a : (1 : A^) — )■ (1 : 
A^). Instead, we propose to include a a cycle randomisation step in systematic resampling; 
that is, we take A\'^ = A^ ' , where c is a A^— cycle (chosen uniformly among the A^ possible 
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cycles) and Aj'^ is the vector of labels obtained from systematic resampling. It is trivial 
to see that that the corresponding algorithm fulfils Assumption (Ra), i.e. it is marginally 
unbiased, and that (Rb), cycle-invariance, is also satisfied. (Rb) could be established as 
follows: it is a property of standard systematic sampling without cycling that if Aj'^ is a 
sample from gt{zl'^,-) then c~^{Af '),... ,c~^{Af ') is a sample from gt{z^ ' ,■)• The 
result now follows from the cycle randomization step. 

We now describe how one may generate A^ — 1 labels from the conditional distribution 
g1{zl'^ , daf'^ \Al = 1). One distinctive property of systematic resampling is that other 
than cycling the only randomness in the algorithm corresponds to the uniform variate U 
in [0,1] drawn to commence the sampling. Sampling from g^lzl'^ ,daf^\A^ = 1) amounts 
to sampling from the posterior distribution of U conditioned on the fact that after cycling 
Aj = 1 (step 1 below), which is then followed by a step that realizes the iVoff-springs from 
the sampled U as in standard systematic resampling, and finally a step (step 3) that chooses 
a cycle uniformly from the set of cycles that ensures Aj = 1. 

The number of off-springs of particle n is either [A^W^^J or [A^W^"J + 1. Again, we call 
"deterministic copy" the [A^W^"] copies obtained by the [A^W^^J first jumps of cursor s in 
interval n, and we call "random copy" the extra copy which is generated with probability 
r" = NW"' — [NW"'\. Conditional on A] = 1, the probability that a deterministic copy of 
1 was put in slot 1 is [A^M^/J /NW^. This observation leads to the following algorithm for 
generating from g1{zl''^ ,dAf^\Aj = 1): 

(1) If NW^ < 1, sample U from W[0, A^VT^] and go to step 2. Otherwise, with probability 
r^ ([A^W^^J + 1) /NW'^ sample U from W[0,r^], otherwise sample U from U[r^, 1]. 

(2) Perform standard systematic resampling (without cycle randomisation and using the 
value of U sampled in Step 1). Call A^'^ the obtained vector of labels. 

(3) Choose C uniformly from the set of cycles such that A"^^^^ = 1 and set A^'^ = A'^^^''^\ 

In the second part of step (1), U is being sampled from its law conditional on A} = 1. In 
step (3) C is sampled from its law conditional on U and A} = 1. 

7. A COUPLING OF THE PARTICLE GiBBS MARKOV KERNEL 

To establish the stability of the Feynman-Kac system, it is common to make the following 
assumption. 

Assumption (G). There exists a sequence of finite positive numbers {gt}t>o such that < 
Gt{xt) < Qt for all Xt E X , t > 0. Moreover, 

If 1 

mQ{dxQ)Go{xQ) > — , inf / m{xt^i,dxt)Gt{xt) > — , t > 0. 

go xt-iexj Qt 

The purpose of this section is to establish the following Theorem. 

Theorem 4. Under Assumptions (G) and (MR), for any t G (0, 1) and T G N+, there exists 
Aq G N+, such that, for all N > Nq, Xq,t, Xq-^t G X'^'^^ , and ip : X'^^^ — )■ [—1, 1], one has 

This means that, for A^ large enough, the kernel P^ is arbitrary close to the independent 
kernel that sample from Q^-. A direct corollary is that, again for A^ large enough, the kernel 
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P^ is uniformly ergodic (see e.g. Roberts and Rosenthal (2004) for a definition), with an 
arbitrary small ergodicity coefficient. 

The proof of Theorem |4] is based on the following coupling construction: let 7t(dz^, dz^) be 
a joint distribution for the couple (Z*, Z*), such that the marginal distribution of Z*, resp. 
Z^, is P^{xo;T,dz^), resp. P^{xo;T,dz^). Then 

P^{^){Xo:t) - P^{f){Xo:T) = E^ {^{Z't) ' <^(^t)} 

= E^{(^(Z^)-^(Z^))l^^^^^^j} 



The following section describes the coupling we are using. Section 7.2 then establishes 
that this particular coupling ensures that 

(7.1) P^(Z^ ^ Z^) < e/2 

for A^ large enough, which concludes the proof. 

7.1. Coupling construction. The coupling operates on the extended space corresponding 
to the support of the conditional distribution (5.3). The idea is to construct two conditional 
particle systems generated marginally from ( |5.3[ ), that is, two systems of A^ — 1 trajectories, 
denoted respectively (Zq':^ , Aq:^_^) and (Zq':^ , A'^.^_j) , that complement respectively the 
trajectory xq;t (first system) and xo:T (second system), in such a way that these trajectories 
coincide as much as possible. We will denote by Ct C 1 : A^ the set which contains the particle 
labels n such that Z" and Z" are coupled. Let Cf = (1 : N)\Ct; by construction, Cf always 
contains 1, since the frozen trajectory is relabelled as trajectory 1 in (5.3). Before we define 
recursively Ct, (Z^;^, A^:^ 



T~l) 



and {Zq':;^ , Aq:^_{) , we need to introduce several quantities. 



such as the following empirical measures, for t > 0, 



neCt 



the following probability measures, /io((i-2o) = ^TT-oidzo) 



nee- 
and for t > 1, 



^J't{dzt) 

l4{dzt) 
filidzt) 



'^Gt-Aict-Mdzt-i)qt{zt-i,dzt), 



Xi+l 



/ '^Gt^r{^q_J{dzt^i)qt{zt-i,dzt] 
'^Gt-Aic-_,)idzt-i)qtizt-i,dzt] 



Xt+i 



where 



the measures "^Ct- 
the constants 



^Ct-iidzt-i)Gt^i{zt^ 



}x^ict-Adzt-i)Gt-i{zt-i) 
{rnc<^_ ){dzt-i) and \I/Gt-i(^c=_ ){dzt-i) being defined similarly, and finally 



A 



^Ct-iiGt-i] 



t-i 



^c..AGt-i) + ^cfJGt-i) 



\t- 



^Ct-iiGt-i 



^Ct-iiGt-i) + ^Ct-iiGt-i 
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and the measures 

1 - \t-i A Aj_i 1 - Ai_i A At_i 

Ai_i — Ai_i 1 — Ai_i V At_i ^ 

1 - At_i A Ai_i 1 - \t-i A At_i 

We now construct Ct and the two particle systems as follows. First, set Cq = 2 : N, hence 
Cq = {1}, draw Zq independently from mo, and set Zq = Zq, for all n G 2 : A^. Recall that 
Zjq ^ xq and Zjq ^ xq* 

To progress from time t — 1 > to time t, we note that there is a Ai_i (resp. Xt-i) 
probability that v4"_^ (resp. A^_^) is drawn from Ct-i, for any n & 2 : N . Hence, the 
maximum coupling probability for (^"_i, ^"_i) is Af_i A Xt-i- Thus, with probability At_i A 
At_i, we sample A"_^ from Ct_i (with probability proportional to Gt-i{Z]^^), for m G Ct-i), 
Zt+i ~ qt+i{zf",dZt+i), and take (i^_i,Zf) = (A^"_i,Zf). Marginally, Z^ = Zj" is drawn 
from fit, and we set n E Ct- 

Conditional on not being coupled (hence we set n G Q), (^"_i, ^") and (v4"_i, ^i"_i) may 
be sampled independently using the same ideas. Assume Xt-i < \t-i- With probability 
(At_i — Ai_i), one should sample A"_-^ from Ct_^ and A^_^ from C(_i. And with probability 

1 — Xt-i V A(_i, both A^ and A" may be sampled from Cf. Either way, Z" ~ qt{Zt_!:^^, dZt), 

^r ~ QtiZt_!:{^ ,dZt), independently. By symmetry, the case At_i > Xt-i works along the 
same lines. Marginally (when integrating out A'^_i and A^_i), and conditional on not being 
coupled, the pair (Z", Z") is drawn from Kj. Clearly, this construction maintains the correct 
marginal distribution for the two particle systems. 

At the final time T, the trajectories Z^, Z^ that are eventually selected, that is, the output 
of Markov kernels P^ [xq-t, dz^) and P^{xq-t, dz^) may be coupled exactly in the same way: 
with probability At A At, they are taken to be equal, and Zt = Z^ is sampled from /it; and 
with probability (1 — At A At), {Z^, Z^) is sampled from kt- 

The motivation for this coupling construction is that it is the maximal coupling for quanti- 
fying the total variation norm between CPF kernels P^ {xq-t, ■) and P^{xq;t, ■) when either 
T = or T > and rrit is a Dirac measure for all t. Details of proof of this fact can be 
obtained from the authors. 



7.2. Proof of inequality (7.1). We now prove that the coupling construction described in 



the previous section is such that inequality (7.1) holds for A^ large enough. 



By construction, one has that P(Z^ = Z^) > E(At A At). Consider the event 

I \Ct\ - 2 
Given Assumption (G), and the definition of At, one has 



^CriGr) 
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and the same inequality holds for 1 — At, which leads to: 

(1 - At A At) x I^ < 2— if^^ x I^ < 2^^ x I^ 

where the second inequality is due to Assumption (G). Therefore, for fci, . . . , /cy G 1 : A^, 

E { (At A At) x I^| |Ci| = A^ - A;i, . . . , \Ct\ = N - kr] 



- V'-'^W^St] ^{U\Ci\ = N - h, . . . ,\Ct\ = N - kr} . 

Conditional on Z^-^^, and n G Ct, -^t i^ ^^ independent draw from iiT^dzx)- Thus, by 
Hoeffding's inequality (Hoeffding, 1963), 

E { (1 - 1^)1 |Ci| = AT - A;i, . . . , |Ct| = AT - ^t, Z]^^^ = 4^5i} 



< exp -2(A^ - kr) 



f-iriG-^' 



IT 



< exp 



{N-krY 



2g^ 
where the last line again follows from Assumption (G). Thus 

E { (At A At) x I^| |Ci| = A^ - A;i, . . . , \Ct\ = N - kr] 



>u-2J^AU-..,( c^-*-' 



'N-kT^'j r ""V V 



T 



\T 



Finally, for any sequence of integers Li-^t G (1 : A^)^ 

E(At a At) > 5^ . . . 5^ E { (At a At) X I^I |Ci| = at - fci, . . . , \Ct\ = N - kr} 

ki=l kT=l 

xP(|Ci| = A^-A;i,...,|Ct| = A^-A;t) 



Lt 2\ L f (N-L. 



:T) 



X ^... ^ xP(|Ci| =A^-A;i,...,|Ct| = A^-A^t) 

provided A^ is large enough. To conclude, we resort to a technical lemma, proven in the 
following section, that states it is possible to choose Li, . . . , Lt large enough so as to make 
the sum of probabilities in the last line above as large as needed. In addition, for Lt fixed 
and A^ large enough, the two factors in front of that sum are arbitrarily close to one. 

7.3. Technical lemma. 

Lemma 5. Under Assumptions (G) and (MR), and for any 5 G (0, 1), T G N+, there exist 
positive integers Nq, Li, . . . ,Lt such that for any N > Nq and Xq-t, Xq-t G X'^'^^, 

Y,---Y. ^ (l^il =N-ku...,\CT\ = N-kT)>il- 6)'^. 

ki = l kT=l 
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Proof. Let 

and recall that Ut is the probability (conditional on Z^'^) that n G Ct+i, i.e. that particles 
Z^^Yi ^^d Z^"^^ are coupled. Thus, and using the fact that, under assumption (R), the particle 
system is permutation invariant, one has: 

P(|Ci| =N-ki,...,\CT\ = N -kr) 

n Ur - ^ n '^*(^^"' ^^") n /^*(^^") (^ - ^*; 

(7.2) > / n<^n'^*(^^"'^^") n /^*(^^": 

(7.3) X U 



t=0 l,?i=2 n=kt+l 

(iV-l)!(l-a;t)^''+'"'M / c^f"''*^'^ 



(fct+i-l) iN-kt+{) 

UJt 



{N-kt+i)\ J [(A;i+i-l)! 

with the convention that fco = 1? that fiQ{dzo) = mo{dzQ), and that empty products equal 
one, and defining the event At as 

(T.4) ^,.|!Eji^>^ 

Note that the integrals above must be understood as conditional probabilities, since fit and 
Kt are conditional measures. In addition, the conditional distributions (first bracketed factor 
in the integral above) defining these conditional probabilities are such that \Ct\ = N — kt 
with probability one. 

For the sake of transparency, we complete the proof for T = 2, but we note that exactly 
the same steps employed may be extended to general case where T > 2. 



The key idea is to replace the two factors in the integrand of (7.2) with their large N 
values which we now define. Let 

(7-5) At = \Ct\ X -^ — ' , for \Ct\ > 

and set Aj = if \Ct\ = 0. Using Lemma [61 stated and proved at the end of this section, one 



has, for fixed ki, ^2, S, and A^ large enough, that the integral in (7.2) is larger than 

__fio{dzi;)}- 

'■^0 I „=2 



(7.) (i-^)'£,in.oW)[^^ 






X 

^1 Ln=2 n=fci+l 

We now explain how to choose Li, L2 such that, for A^ large enough 
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N ^ ^1 e-Ao^k.-l 



X / n '^i(^^"' ^^^) n ^i(^^i") E r^ ni ^ (^ - ^)'- 

J^^ U=2 n=A:i+l J ^^ = 1 ^^^ " ^J! 

First note that, given Assumption (G), and since |Co| = A^ — 1, \Ci\ = N — ki (with prob- 
abihty one under the conditional distribution that appears in (7.2), for t = 1, as explained 
above), and since icAC^t) > \Ct\ jJ^tiGt)/'^ (by event At), one has (again with probability one 
under the same conditional distribution): 

(7.8) < Ao X I^„ < 2(7o' X U„ < Ai x I^, < 2(?2(A;i - 1) x I^, 

for N > ki (otherwise the probability that \Ci\ = N ~ ki would be zero). Choose Li, then 
L2, such that 

Since a; — )■ e""^ Sfc=o ^^^/A;! is a decreasing function for x > 0, this choice of L2 ensures that 

■^-41 U=2 n=fci+l J fc2=l ^^2 - IJ! 



fci Af 



By Hoeff ding's inequality. 



'-41 l^„=2 n=fci+l 



1^1 n /iiW)[<exp(^-2(Ar-fci; 



2~ ) < exp ( -(A^ - /Ci) 



4^f ; " V 2 

where the last inequality follows from Assumption (G). Using the same calculations for the 



first integral in (7.7) (that is applying Hoeffding's inequality to ^0 and so on), one obtains 
eventually: 



"^ ^ \e-^'^A',^-' 



-4o ln=2 J fci=l ^'*^1 ^ 



g-Ai^fc2-l 



; (^2 - 1) 



^ r fci AT ^ L2 

X / IUkm^iz^)) n /^iW) E 

■'■^i tn=2 n=fci+l J fc2=l 

> (1 - <5)2 (^1 - exp (^-(AT - LO^)) (1 - exp (^-(AT - 1)^ 
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and therefore, combining this with (7.6), one may conclude that 

Li L2 
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provided A^ is taken to be large enough. 

D 

To conclude the proof, we state and prove the following lemma, which we used in the proof 



above in order to replace the two last factors in (7.2) by their large- A^ values. 

Lemma 6. Assume (G). For any given 5 > and positive integers ki, . . . ,kT there exists 
a positive integer Nq such that the following inequalities hold for all N > Nq and xqt, 

a;i^-'*+^)exp(A,)xU>(l-<5)xU, 
,[^~^^\, (1 - utf'^'-^^ -^^ X I^, > (1 - 5) X U. 

Proof. Given that \Ct\ = N — kt and the respective definitions of Ut and A^, one has: 

l-oot ^ At 
ujt ~ N-kt 

and therefore for A^ large enough and kt fixed, and conditional on I_4j = 1, the probability 
1 — uJt may be made arbitrary small, given that A^I^^ is a bounded quantity; see ( 7.8[ ). Since 



log(l + x) > X — x"^ for X > —1/2, one has, for A^ large enough (so that x = Ut — 1 > —1/2) 
and conditional on I_4^ = 1, 

A, + (AT - ^,,0 log c, > A, { 1 - ^^^-, - ^^A,.,^ 

which can be clearly made arbitrary small (in absolute value) by taking A^ large enough, 
since both Ut and A^ are bounded quantities. The second inequality may be proved along 
the same lines. D 

8. Forced move 
Step CPF-2 may be replaced by one that tries to force a selection of a trajectory different 



from the reference, in the spirit of the metropolized Gibbs sampler of Liu ( 1995 ). Specifically, 
CPF-2 is replaced by a Metropolis-Hastings step on the variable A^* instead of the Gibbs 
step. The proposal is concentrated on index n = 2 : N with probability W^/{1 — W^), using 
the short-hand W^ = Wf{z^^). Thus, set A^* = n with probability 



wp . fl-W^ 



mm 



wy vi-w^T 



for 77. 7^ 1; otherwise set A^* = 1. We call this the variation of PG forced move. As in Liu 



(1995), one is able to establish that the forced move strategy always brings an improvement. 



Proposition 7. Assuming < Gt{xt) < Qt for all xt E X and t, then forced move CPF 
dominates CPF in Peskun ordering. 
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Proof. Recall that for Markov kernels Pi and P2, Pi dominates P2 in Peskun ordering iff 
Pi{x,A) > P2{x,A) for all X and measurable sets A with x ^ A. The assumption of the 
potentials are merely to ensure the PG kernel is well defined. Let Zl':^ be the sequence of 
random variables generated in step CPF-1, A^* the outcome of CPF-2, and let N* be the 
chosen trajectory of forced move. Then for zt ^ A 



N 



E 



UZ^' 



z]i 



1:N 
T 



W^ 



W^ 



mm 



n=2 

>E{UZ^*)\Zl:^). 



W^ 



w:^' 



1 uz^. 



D 



A way to quantify the improvement brought by a forced move is to note that Peskun 
ordering implies lag-one ordering and, assuming reversibility, efficiency ordering, see the 
discussion around Theorem [9] in Section 9.4 for a definition of these ordering of Markov 
kernels. 



9. Backward sampling 



It is convenient in this section to revert to standard notations based on the initial process 
Xt, rather than on notations based on trajectories Zt = Xo-^f Thus, we now consider the 
following (extended) invariant distribution for the CPF kernel 

1 



vr. 



rp idXri.'Y' •) (XClr\.qn 1 , (17X I 



,®N 



{dx]^n 



(9.1) 




U^t-l; 



Qt-i{xt-i,daf.'_i^ 



N 

n 

n=l 



mt{x^'_]',dx'^\ 



X ^Gt(x^ ) 



which, compared to (5.1), represents a simple change of variables. Note that denoting the 

is an abuse of notations based on the fact that 



l:Af 
t 



da]--'') 



resampling distribution by Qt{x 

a resampling distribution depends only the weights Gt 

notations, the n-th trajectory ZJj^ becomes a deterministic function of the particle system 



Gt(x?). In this new set of 



{Xq':^ , AQ■.^_-^^), that may be defined as follows: Z^ 

TDTi 

are defined recursively as: BJ^ = n ""'" 

z^- 



B? 



A?^' 



before, Zn 



{xT\ 



function of {X^:^ , A};:^_, , A^ 






, with _By 



= (^0 ,•• 
, for t G 

A^*, B 






where the indexesi?^'s 



(T — 1). Similarly, as noted 






i.e. Z^ is a deterministic 



I^Ml' 0:T-l5 



Whiteley ( |2010 ), in his discussion of Andrieu et al. (2010), see also Lindsten and Schon 



(2012), suggested to add the following BS (backward sampling) step to a Particle Gibbs 
update: 

CPF-3: Let -By = A^*, then, recursively for t = T — ltot = 0, sample index B* = 



.B 



At '+' G 1 : A^, conditionally on i?. 



t+i 



b, from the distribution 



(9.2) 



^N (Ab _ b\ ^1:N _ 1:7V a- 



0:Tt 



Al:N 
^0:t-l 



A:N 



\1.N 



'^0:t-l' ^t+l:T 



l:N yy* 



n*) 



oc 



Qt ( 



X 



V.N 



A 



<\AT 



^) mt+i{x 



t i^t+i/ 
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and set X^ = x'f . (Recall that Z^ = {Xq\ ..., X^^) 



In (9.2) , Aj^ is A\-^ minus A\, aj^ is defined similarly, gt {xj'-^^A'l = a'l\Aj'' = a^'') stands 
for the conditional distribution of A^, given that A^ = a^ , relative to the joint distribution 
Qt{x\'^ ,da\'^), and finally mt+i{xl\ x\j^]) is the probability density (relative to probability 

of' 

measure dx) of conditional distribution rat+i{x/ ,dxt+i) evaluated at point x^^.^. 

This extra step amounts to update the ancestral lineage of the selected trajectory up to 
time t, recursively in time, from t = T — 1 to time t = 0. It is straightforward to show that 

I R* 

(9.2) is the conditional distribution of random variable B* = A^ '^\ conditional on -B*+i = b 



and the other auxiliary variables of the particle system, relative to the joint distribution 



(9.1). As such, this extra step leaves Qxidxa^T) invariant. 

It should be noted that the BS step may be implemented only when the density mt+i{xt, Xt+i] 
admits an explicit expression, which is unfortunately not the case for several models of prac- 
tical interest. 



When this density is tractable, and since the support of (4.2) is 1 : A^, sampling from 
this distribution is a 0{N) operation, provided one is able to compute simultaneously 
Qt [xj'^ , A\ = a'l\A'j~ = a^ ) for all possible values a^ G 1 : A^ in 0{N) time. When multino- 
mial resampling is used, this quantity simplifies to gt [x]'^ , A\ = a'l\A^'' = a~[^^ = W^{xl'^^), 
since the A"'s are independent. We explain now how to compute this quantity when either 
residual resampling or systematic resampling is used. 

9.1. Backward sampling for residual resampling. To describe how to compute the 
probability gt {x}''^ , A\ = a!l\A~[^ = a^'') when residual resampling is used, we use the same 
short-hand notations as Section p| for any n G 1 : A^, W"^ stands for W^{zl''^)^ r" = 
NW^ - [NW\, R = E!Li^"' and in addition let x"(4^^) = Em=i !{«}«). X"(«r^) = 
X]m5^fe^w(^r)5 that is X^i'^t'^)^ resp. x^'i'^t^)^ counts the number of off-springs of particle 
x^ within aj'^, resp. a^^. 

Then, using standard properties of the multinomial distribution, it is easy to see that the 



algorithm described in Section 6.1 (that is residual resampling, plus random permutation) 



actually samples from the following joint distribution with density 



R\ 



(9.3) ^t(^r,«r) = 9yn 



("l^ {x"iar-^)-\NW-\} {x''{al'-^)y- 



n=l 



N\^J-\Rj {x^{al--^)-[NW-\)\- 



Thus, provided a^^ is such that x"(ar^) > [NW] for all n G 1 : A^, 
(9.4) gtixl-"", 4 = mlA;' = a'") oc r"^ x ^"^'"''^ + ^ 



X™(a-'')- [NW""] +1' 

for any m G 1 : A^. In case there exists n G 1 : A^ such that x^i'^T ) = [A^VT"] — 1, then one 
simply has 

gtixl--"", A', = m\A-' = a'') = I„(m). 

This refiects the fact that residual resampling always generates at least [A^VT"] off-springs 
of particle n for each n G 1 : A^. 

During the forward pass of the algorithm, one may compute quickly (that is, without 
changing the 0{N) complexity of the algorithm) quantities such as r", [A^H^^J and x^i^'t''^) 

17 



for all n G 1 : A^. Then, since x"'{0't '') = X"' {O't'^) ^'^{n}io.t) ^ computing (9.4) for all m € 1 : A^ 
is also a 0{N) operation. 

9.2. Backward sampling for systematic resampling. One difficulty in computing the 
probability gt [x]'^ , A\ = a^l^^^ = ^^^) that is specific to systematic resampling plus cycle 
randomisation, as described in Section 6.2, is that one needs to consider several cases, de- 
pending whether one or more than one cycles may have generated A^^ = a^^ . To outline the 
main ideas, we consider here only the case where the cycle is uniquely determined by inspec- 
tion of a~[ , and defer other cases to Appendix A. This amounts to assuming that a^~^ < a^"*"^. 
To avoid treating separately 6=1 and b = N, we abuse notations as follows: a^"^ refers to 
a^ when b = 1, and a^"*"^ to a] for b = N. This can be seen by recalling that Aj'^ = A^ 
(see Section 6.2) where A^'^ is the non-decreasing sequence taking values in 1 : A^. Thus if 
ttf.^^ < a^^^, c is determined by c{m) = 1, where aj" is the unique component of a^ such that 

ar' > aT. In that case, ft (a;^, ^t = a'Mt' = «r') = Qt (^^^ X = «f 14^'' = ^r"'), 

where the latter quantity is the probability that A^' = a, conditional on A^^ = d^^ , with 
b' = c{b), which, in turn, may be deduced from the joint distribution of the labels before 

cycle randomisation ft [x]''^, A^' = a]'^) as follows: 

ft (xr, a'; = aflA ' = a;') oc ft (x^, X"" = a^) , 
i.e. compute the right hand side for all Oj E 1 : N, and a fixed a^ , then renormalise it to 



obtain the left hand side for all a^ E 1 : N. 



Let ^(0) = 0, S{n) = A^Em=iW^"' for n G 1 : A^. By construction, see Section 6.2 



the probability ft ixl'^, A^' = a\'^ j equals the probability that a uniform random variate 
u ~ W[0, 1] simultaneously fulfil the following A^ inequalities, for n El : N , 

(9.5) S{a''^ -l)-{n-l)<u< S{a''^) - {n - 1). 

In other words, ft ( x\'^ ^ A^' = a^'^ j is the length of the interval defined as the intersection 

of [0, 1] and the A^ intervals defined above. 

To efficiently compute ft ixj'^, A^ = a\ |Aj = a~[ j for all Oj E 1 : N ^ one may pre- 

compute the intersection J{a^ ) of [0, 1] and the A^ — 1 intervals above for n ^ b' . Then 

computing ft (x]'^^ A^' = a]'^ j for all a^ G 1 : A^ reduces to computing the intersection of 

J{ciJ^ ) and the interval (9.5) for n = b' and the considered value of a\. Thus, this backward 
step may be implemented in 0{N) time. Note that the S'(n)'s have already been computed 

during the forward pass of the algorithm. Also, in practice ft [x]'^, A^ = af 1^^ = a^^ ) 
needs to be computed only for a^ in the range a^~^ < cti < '^t'^'^, ^^ this probability is zero 



b-l \ „b+l 



by construction outside this range. This remarks allows for an extra speed-up. 

Again, we refer to Appendix A for how to implement backward sampling when a1~^ > af 
(and therefore one need to take into account cycle uncertainty). 

9.3. Peskun ordering and backward sampling. Peskun ordering was useful in estab- 
lishing that forced move PG is preferable over PG and one might be tempted to think 
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that Peskun dominance of PG with backward saniphng over PG could also be shown. The 
following counter-example shows otherwise. 

Example 8. Let A:" = M, {Xt)t>Q be an i.i.d. sequence with marginal law in{dx), and let the 
potentials be unit valued, i.e. Gt{xt) = 1 for all t. Then one can show that the CPF kernel 
with backward sampling does not dominate the CPF kernel in Peskun sense. For example, 
let i3 = i3o X . . . X Bt C X'^^^, where m{Bt) = e for all t. If we choose a reference trajectory 
xo:T ^ -B but Xt G Bt for t ^ T, then it is easy to show that (e.g. when T = 2 and N = 2) 
that the probability of hitting B when starting from xo:t, i-G-Pri^o-.T, B), is higher without 
backward sampling than with it. In this example, a chosen trajectory that coalesces with 
the reference trajectory has more chance of hitting set B. 

9.4. Reversibility, covariance ordering and asymptotic efficiency. Backward sampling 
(BS) does in practice bring improvement to the decay of the autocorrelation function of suc- 
cessive samples of Xq-t generated by the Particle Gibbs sampler, i.e. more rapid decay 



compared to not implementing BS; see our numerical experiments in Section 10 However, 
how much improvement depends on the transition kernel mt{xt-i,dxt) of the hidden state 
process (Xi)^>Q. If only Xq ~ mo is random while mt{xt-i,dxt) is a point mass at Xt-i for 
t > 1, then it is clear that BS will bring no improvement. We can however prove that, 
regardless of rrit, the empirical average of the successive samples from a CPF kernel with BS 
will have an asymptotic variance no larger than the asymptotic variance of the empirical av- 
erage of successive samples from the corresponding CPF kernel without BS. The asymptotic 
variance here is the variance of the limiting Gaussian distribution characterised by the usual 



n-Central limit theorem (CLT) for Markov chains; see for example Roberts and Rosenthal 



(2004) 



The following result due to Tierney (1998), see also Mira and Geyer (1999), formalises 



this comparison, or ordering, of two Markov transition kernels having the same stationary 
measure via the asymptotic variance given by the CLT. We call this efficiency ordering. 



Theorem 9. (Tierney, 199^ For ^q^^i^... successive samples from a reversible Markov 
transition kernel H (on some general state space) with stationary measure ti , where io ^ t^, 
and for f G i^^(vr) 



{/ : J Tc{dx)f{x)'^ < oo} let 
v{f, H) = lirrin^nc— Variance 



'n—l 



n 



E/te 



Let Pi and P2 be two reversible Markov kernels with stationary measure vr such that 
E.^Pi {^(^0)^(6)} < E.«P2 {9i^o)9{^i)} for all g G L\t,). Then v{f,Pi) < v{f,P2) for all 
f G L'^{7f) and Pi is said to dominate P2 in efficiency ordering. 



Note that in the original version of this theorem by Tierney (1998) the requirement on Pi 
and P2 for v{f, Pi) < v{f, P2) for all / G L'^{'^) is that Pi dominates P2 in Peskun ordering. 



However, Tierney's proof actually makes use of the weaker Peskun implied property of lag-1 



domination instead, as also noted in Theorem 4.2 of Mira and Geyer (1999). 



To prove efficiency odering, we must prove first that the CPF kernel, with or without BS, 
is reversible. Following reversibility we then need to show that the CPF kernel with BS has 
smaller lag 1 autocorellation compared to the CPF kernel without BS, this property if holds 
is called lag-one domination. 

19 



Proposition 10. Assume (R). The CPF kernel is reversible. 

Proof. This result is based on tlie equivalent representation of the CPF kernel derived in 
Proposition 3 Consider a Qxidz^) ® P^ {z^ (ii^)— measurable function h and let (Z^ Z^) = 
{X*.T, X*,t) --Qt(S)P^ then 

E{h{z^z^)} = /"qt(c^<t) /7l"T(^4?^^^4;?-l'^^S:T-l.^^1<T) 

X / 7Crp I (171/ Xq.'P 5 Ctri.n^ 1 ) niZrp ^ Zrp j 

N f 1 1:N 1 1:N ^ ^\ I ^f^'^^l ^'-^ 1-^ \Uf "-* "■*^ 
TTn^ ( CJXq.'T' J Cldryrp 1 ^ (171/ J I TTrp {(171/ U^Q-T^ 5 ^O'T 1 j'^y^'T' •) -^7^ / 

TTnn \,(XJbr\.rp ^ (lClr\.rp -i ^ (111 I / Tirri {(lib Xri.Tn ^ (lir\.rp t j fl,{,Zrp ^ Z/rp \ 



where the first equality follows from Proposition [3] and the remaining steps are a change 
of variables; again, z^ , resp. z^ , must be understood as a certain deterministic function 
of (a^o;T5 ^o-T-i' '^*)' resp. (a^o-r; ^o-r-i''^*)' i^ ^^^ equations above; see notation at the 
beginning of this Section. D 



We now prove a similar result for CPF-BS, the CPF kernel with backward sampling. To 
that effect, we must first generalise Proposition Is] to the CPF-BS kernel. (Proof of this 
corollary uses the same construction in the proof of Proposition [s] and is omitted.) 



Corollary 11. Assume (R). Let Pj,' denote the CPF-BS Markov kernel. Then the image 
of z^ G X"^^^ under P^ ' is unchanged by the choice of {bQ.j<_i,n*) for the realization of 
{BQ.rp_-^^, N*) in the initialization, i.e. Step CPF-1. 

Proposition 12. Assume (R). The CPF-BS kernel (CPF with backward sampling) is re- 
versible. 

Proof. A simple proof in the multinomial resampling case, i.e. Assumption (MR), is obtained 
by remarking that the image of CPF-BS would be unchanged if Step CPF-3 would be replaced 
by a Gibbs step that would update the complete genealogy, i.e. replace AqIt-i by A}^:^_^, a 
sample from TTJ^{da^:^_i\x}^:^,n*). This is because under multinomial resampling, the A"'s 
are independent conditionally on (Xq/-^, A^*). We defer the technical proof under the more 
general Assumption (R) to Appendix B. 
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Consider a Qxidz^) ® P^ {z^ (ii|>)-measurable function h and let {Z^ Z^) = (^q.^-, Xq. 



0:T;^»-0:T7 



}t ® Pt then 



/TV/ 7-^1 1:A^ 1:A^ \ / Nfi-^liN \ 1:N -*\^,/ "-* ri*\ 

TTrp {(ICCrt.'Y' , ddri.rp 1 ) / TXrp [dTl Xn.T^ j'^'V [ddrt.n-' 1 ■^O'T^ 5 ^ / 

X [ TT^ {dh*\xl:^) [ 71^ {dd^Q\^_^\x^o:^ ,n*)hizJf , z^*) 



E{h{Z^,Z^)} 



where the second equahty is a consequence of Corollary [TTJ the third equality is based on 
the fact that one may generate {Xq':^ , Aq.^_-^^, N*) ~ vr^ as: {Xq':^ , AQ.rp_^, N*) ~ vr^, then 

update ^o:T-i ^^ ^o-t-i through the Gibbs step t^t {^^o-.t-iI^o-.t ^ ^*)^ ^^d the fourth equality 
is a simple change of variables. The simplification of n^ {dn*\xQ:^ , aQ.^_i) into 'n'^{dn*\xQ:^) 
(second equality onwards) reflects the fact that Step CPF-2 does not depend on ag-T-i- ^ 

The final result shows that the CPF-BS kernel dominates the CPF kernel in lag-one 
autocorrelation under Assumption (MR). 

Proposition 13. Assume (MR). The CPF-BS kernel, denoted Pj, ' , dominates the CPF 
kernel in lag one autocorrelation, i.e. let h be square integrable function then 

< E^^^p.^B {hiZ*^)hiZ*^)} < Eq^^p. {/i(Z^)/i(Z^)} . 

Proof. We use again the facts that ii^ {dh*\xQ.^ , aQ^_i) reduces into n^{dfi*\xQ.^), and that, 
under multinomial resampling, Step CPF-3 may be replaced by a Gibbs step that updates 
the complete genealogy as in the proof of Proposition IT2| 



X fn^idn'lx'o,^) /"<(rf4^_i|xj:^,fi^)/i(^^*)/i(4*) 
= I n^{dxl^) (JT^Udn^lxl^) /"7r^(t/aJ;^_JxJ:^,n*)/i(4 

< I T^T{dx\^T) f T^Tida^T.^xl^) ( f 7r^idn*\x'o.^,a'o:^_M4*) 

= Eq^^p^^{/i(Z*)MZ^)} 

The penultimate line uses Jensen inequality. The last line is indeed the the same expectation 
but under Qt ® Pt (no BS step). D 
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Combining this result with Propositions 10 and 12 (reversibihty of both kernels) and The- 
orem [9] from Tierney (1998), we conclude that the CPF-BS kernel dominates the CPF kernel 
in efficiency ordering (smaller asymptotic variances), at least when multinomial resampling 
is used, i.e. Assumption (MR). 

For other resampling schemes, establishing lag-one dominance for all test functions does 
not seem to be easy, so one might consider an alternative measure of efficiency for MCMC 
algorithms. The Mean Square Jumping Distance (MSJD) is one such measure (e.g. Andrieu 



and Thoms , 2008 ) and in the following result we show that incorporating backward sampling 
will not decrease the MSJD for a certain class of additive test functions. Note that for 
Multinomial sampling we were able to prove a stronger version of the stated result below 



(Proposition 13) where the test function need not be of the additive structure required 



below but just square integrable with respect to tit- The main reason why we are limited 
to the following weaker result when one only assume (R) is that for multinomial sampling 
Qt {xj'^ , A\ = ctjIAj" = a^ ) = Wj^{xl'^), i.e. A^^s are independent even when conditioning 
on(Xi;^,A 
schemes. 



^\A'^), the same though is not true for the systematic and residual sampling 



Proposition 14. Assume (R) and consider the following additive square integrable (with 
respect to ttt) function 

h{xo;T, bo;T~i, n) = ho{xo;i,bo.,i) + hi{xi;2, bia) h /iT-i(a;T-i:T, br-i^n). 

The CPF kernel with the additional step CPF-3 (backward sampling) has MSJD (mean square 
jumping distance) no less than the same kernel without step CPF-3, i.e. 



E 



rp Qy ±rp 



{hiZ^,B*.T._„N*) - hiZ^T^K.T-i^N"))' 



> E, 



fcg)Pf 



(/i(Z^,SoVi,A^* 



h[Zrp, Bq.J, 



-i,N^)y 



where {{Z^, Bq. 
expectation and tt^ 



T-l5 

N 



N* 



' 7* R* 



P^ in the second. 



_]^,iV*)) is distributed according to n^ 



N 



Pj, ' in the first 



N,B 



For the sake of generality, this results regards Prp ' and 



invariant measure 7r^{dz^,dbQ.rp_^,dn*). By Corollary 11, omitting the dependency of h on 



on 



X^+^ X X^+\ 



P^ as Markov kernels with 



N,B ■ 



bo:T~i, n in Proposition 14 will take us back to the usual setting where Qt^Pt ' is a measure 



It is noteworthy that there are some models for which the update of the parameter 6 de- 
pends on xq-t precisely through functions h of the above form; see the update for parameters 



/i, p and 1/cr in Section 10 Therefore maximising the MSJD for h in the stationary regime 



should beneffi the exploration of these parameters. The proof of Proposition 14 may be 
found in Appendix C. 



10. Numerical experiments 

The focus of our numerical experiments is on comparing the different resampling schemes 
described in this paper (multinomial, residual, systematic), with or without backward sampling 
The forced move step described in Section |8] is implemented. 
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We consider the following state-space model 
Xo~ A^(/i,a2), Xt+i\Xt = xtr^N{fi + p{xt-i2),a^), Fi|Xi = x* ~ Poisson(e^') 
for t G : T, hence one may take Gt{xt) = exp {— e^* + ytXt}, where yt is the observed value of 



Yf. This model is motivated by Yu and Meng (2011 ) who consider a similar model for photon 



counts in X-ray astrophysics. The parameters /z, p, a are assumed to be unknown, and are 
assigned the following (independent) prior distributions: p ~ Uniform[— 1, 1], /i ~ N{m^, s^ 
and l/o"^ ~ Gamma(ao-, &0-); let 9 = {p,p,a'^). (We took m^ = 0, s^ = 10, a„ = b^ = I 
in our simulations.) We run a Gibbs sampler that targets the posterior distribution of 
{9,Xq:t), conditional on Yq,t = Vo-.t, by iterating (a) the Gibbs step that samples from 
9\Xq,t,Yo,t, described below; and (c) the Particle Gibbs step discussed in this paper, which 
samples from Xq,t\9,Yq,t. Direct calculations show that step (a) may be decomposed into 
the following successive three operations, which sample from the full conditional distribution 
of each component of 9, conditional on the other components of 9 and Xq-t'- 

( T + 1 1 1 x~^ 

l/cr^|Xo:T = a^oiT, ^o:T, /", P ~ Gamma a„ H — , h^ + -x\ + - y^ 



2 '^ 2 



(Xt+i - pXi)^ 



Y^r-i ~ ~ 2 

I \^ v iKj \ 2^t=o ^tXt+i cr 



EU^l 'E 



i=0 



Xt 



\Y V AT I ' J -'^ I ^0 + (l-p)Ef=0^(a;t+l 
/i|Xo:T = Xo;T, 1^:T, p,a ^ N \ ^ \ —f ^ -^ 

where we have used the short-hand notations Xt = Xt — p, 





Xu = ^ + 



1 , l + T(l-p) 



2 



" sl 



and where A^[_i^i](m, s^) denotes the Gaussian distribution truncated to the interval [—1, 1]. 
In each case, we run our Gibbs sampler for 10^ iterations, and discard the first 10^ iterations 
as a burn-in period. Apart from the resampling scheme, and whether or not backward 
sampling is used, the only tuning parameter for the algorithm is the number of particles N 
in the Particle Gibbs step. 

10.1. First dataset. The first dataset we consider is simulated from the model, with T+1 = 
400, p = 0, p = 0.9, a = 0.5. 



Fig. 10.1 reports the ACF (Autocorrelation function) of certain components of {9,Xo-t) 
for the six considered variants of our algorithm, for N = 200. 

One observes the following in this particular case. First, without backward sampling, there 
is a clear advantage of using systematic resampling over the two alternative resampling 
schemes. Second, the improvement brought over by backward sampling depends on the 
resampling scheme: it is small when systematic resampling is used, and quite important 
when either multinomial or residual resampling is used. Third, the fastest mixing seems to 
be obtained by multinomial resampling with backward sampling, and residual resampling 
with backward sampling. 

Note that this ACF comparison does not take into account the relative CPU cost of each 
variant of Particle Gibbs. As a rule of thumb, one may consider that the different resampling 
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Figure 10.1. First dataset: ACF functions for different components of 
{6,Xq.t) and the six considered variants of Particle Gibbs (A^ = 200.) In 
all cases performance of multinomial with BS and residual with BS is hardly 
distinguishable. Also Systematic with and without backwards performs simil- 
arly, except maybe for a. 



schemes have essentially the same CPU cost. In fact, it is often the case that the resampling 
step represents a small part of the total CPU cost of a particle algorithm. On the other 
hand, backward sampling amounts to a backward pass through the data, which represents a 
significant extra cost, relative to the forward pass already performed during the generation of 
the particle system. However, this extra cost is of course highly model and implementation- 
dependent. In that respect, if CPU cost was taken into account, then systematic resampling 
without backward sampling would appear be the best choice in this particular exercise. 
It is also worthwhile to look at the update rates of Xt with respect to t which is defined 



as the proportion of iterations where Xt changes value; see left panel of Figure 10.2 This 
figure reveals that backward sampling (when used in conjunction with either multinomial 
or residual resampling) increases very significantly the probability of updating Xt to a new 
value, especially at small t values, to a point where this proportion is close to one. This 
also suggests that good performance for these two variants might be obtained with a smaller 
value of A^. 

To test this idea, we ran the six variants of our Gibbs sampler, but with A^ = 20. The 



right side of Figure [10. 2| reveals the three non-backward sampling algorithms provide useless 
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Figure 10.2. First dataset and resulting update rates of Xt versus t G : 
399. Left plot is for A^ = 200 and right is for A^ = 20, same legend as Fig. 
10. 1[ Note that for A^ = 200 and A^ = 20 it is hard to distinguish between 



multinomial resampling with backward sampling and residual resampling with 
backward sampling. For A^ = 20, forward only versions of systematic and 
residual perform similarly. 



results because components of Xo:3oo hardly ever change values. For the same reasons, the 
ACF's of these variants do not decay at reasonable rate (which is not shown here.) 



Figure 10.3 gives the ACF of certain components for the three variants with backward 
sampling. Although A^ is quite small, the MCMC chain seems to be mixing well, especially 
when either multinomial or residual resampling is used. 

10.2. Second dataset. We consider a second dataset, simulated from the model with T + 
1 = 200, /i = log(5000), p = 0.5, cr = 0.1. (These values are close to the posterior expectation 



for the real dataset of Yu and Meng (2011). 



What makes this example interesting relative to the first one is that the impact of backward 
sampling is even bigger in this case. We have to increase N to N = 1000 to obtain non- 
zero update rates for the three variants that do not use backward sampling, whereas good 
update rates may be obtained for A^ = 20 for either multinomial or residual resampling, when 



backward sampling is used; see Figure 10.4 Again, we observe that backward sampling leads 



to very good performance even when A^ is small, see also the ACF in Figure 10.3 



11. Discussion and Conclusions 

The following guidelines seem to emerge from the numerical results. First, when backward 
resampling cannot be used (e.g. when the probability density of the Markov transition is not 
tractable), one should run Particle Gibbs with systematic resampling, as this leads to better 
mixing. A possible explanation is that, when only a forward pass is performed, the lower 
variability of systematic resampling makes it less likely that the proposed trajectories in the 
particle system coalesce with the fixed trajectory during the resampling steps. Therefore the 
Particle Gibbs step is more likely to output a trajectory which is different than the previous 
one. 

Second, when backward sampling can be implemented, it should be used, as this makes it 



possible to set A^ to a smaller value while maintaining good mixing; see also [Lindsten and 
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Figure 10.3. ACF for certain components of {9, Xq-t) and the three variants 
of the algorithm that uses backward samphng (A^ = 20); same legend as Fig. 



10.1, Top row is for the first dataset, bottom row is for the second. 




Figure 10.4. Second dataset, same plots as Fig. |10.2| with A^ = 1000 (left 
panel) and A^ = 20 (right panel). Same legend as Fig. 10.1 In the left plot 
systematic with and without backward and forward only residual largely indis- 
tinguishable. In the right plot the three forward only schemes indistinguishable 
before t ^ 190. 
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Schon (2012); Lindsten et al. (2012) for similar findings. In this case, either multinomial or 



residual resampling may be used, as they seem to lead to similar performance and they both 
outperform systematic sampling. It seems more complicated to give a precise justification for 
this phenomenon. One possible explanation is that the greater variability of e.g. multinomial 
resampling during the forward pass also results in greater liberty in redefining the ancestry of 
the chosen trajectory in the backward sampling step; recall that with multinomial resampling 
one is able to update the index a" of the selected trajectory, without conditioning on a" , 
n' ^ n, as opposed to the two other resampling schemes. For example, the update of a" in 
the backward step of systematic resampling appears to be very constrained. 

When backward sampling is used, multinomial and residual resampling seem to lead to 
very similar ACF. Perhaps an additional advantage of residual resampling is that it may 
lead to a smaller variance of the estimator of the likelihood of IqiT given 6*, as this quantity 
depends only on the variables computed in the forward pass. This quantity could be used 
for instance to compute the evidence of the model (the marginal likelihood of Yq-t)-, using 
the harmonic mean estimator ( Gelfand and Dey [1994 ) . 

In all cases, we recommend inspecting the same type of plots as in Fig. 



10.2 and 10.4 



that is, update rate of Xt versus t, in order to assess the mixing of the algorithm, and in 
particular to choose a value of N that is a good trade-off between mixing properties and 
CPU cost. An interesting and important theoretical line of research would be to explain 
why this update rate seems more or less constant when backward sampling is used, while it 
deteriorates (while going backward in time) when backward sampling is not implemented. 
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Appendix 



A. Backward sampling for systematic resampling: remaining cases. We treat in 
this section the extra cases necessary to complete the description of the backward step under 



^H^r, 



A? 



systematic resampling started in Section 9.2, and in particular to compute the probability 



lA" 



) for all a^ G 1 : A^ and some fixed set a^ of A^ — 1 labels. We use 



exactly the same notations. In particular we avoid treating separately the boundary cases 
6 = 1 and 6 = A^ by abusing notations: e.g. 



,6-1 



> at should read a+ > ai when h 



and so on. Recall that the case a^ < a^ is already treated in Section 



Assume first that a^ > a^ . Then two cycles are compatible with a^ 



9.2 



where c is the cycle determined by c (1) = b, i.e. A 
is the cycle such that c(l) = b + 1, i.e. A]'^ = {A^ 
uniformly among the A^ possible cycles, it comes that 



1:N 
t 
b+l:N 



either c = c , 
= {A^-^, A];^-^), or c = c+, where c+ 
A\''^). Since the cycle c is generated 



Qt[.x 



1:N 



A 



\A: 



OC ft X 



A:N 



A 



1:N 



'(l-.N) 



+ Qtix 



„1:N 



A 



1:N 



'(l-.N) 



rl-.N 



^{1:N) 



where ft lzl'-'^,A 

Now assume that a^~'^ 
A\ = m with probability one. 



9.2 



a\^' 



may be computed as explained in Section 

= m. If at least one component of a~[ differs from m, then 
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Now assume a^ = m for all n ^ b. We consider the three different possible values for 
[NW'^l , which must be either A^ - 2, A^ - 1 or A^, separately. If [iVM^'"] = N, then A'} = m 
with probability one. If [iVVr"'] = N—2, one is in the same situation as the case a^~^ > a^"*"^; 
that is, only two cycles are compatible with a'j~ which are either c^ or c~ defined previously, 
and proceed exactly as was done there. Finally, assume LA^iy™J = A^ — 1. Then all possible 
cycles are compatible with a^ , that is 

cec 
where C is the set of the A^ cycles of 1 : A^. 



at 



B. Reversibility of the CPF-BS kernel under Assumption (R). Consider the dis- 
tribution 71^ {daQ.^_^\xl^.^ , n*) , i.e. the distribution of Ay.^_^ conditional on (Xq/-^, A^*) = 



(xo:^,n*), relative to the joint distribution defined in (5.1). This conditional distribution 
factorises as a product of independent conditional distributions for the vectors Aj'-^ , with 
probability density: 



7r^(4^^|xJ:^,n*) (X g,{xl--'',al--'')l[m,^,{xf,xt 



n=l 



for any t £ : A^ — 1, and where gt{xl'^ ,al''^) stands for the probability density of 
gt{xl'^ , daY^). The BS step updates, recursively from t = T — ltot = 0, a single component 
A" of the ve ctor A}''^ , by sampling from its distribution conditional on the rest of that vector 



A^' , see (9.2). Because of this, the BS steps leaves 'K^{daQ.^_]\xQ.^,n*) invariant. The 



index of the updated component A^l at time t is determined recursively by the value taken 

by A^*; for A^* = n*, one update A^_^, call b^_i the obtained value, then one updates Ar^Z2 
and so on. 

Thus, one may formalise the BS step as a probability kernel i^oir-ilao-T-i' ^q'-t-i^^^ ^ ^^oIt-i) 
that modifies the genealogy ^o^j^-i of ^^e particle system, while keeping invariant its condi- 
tional distribution, t:^ {da^:^_i\xQ.^ , n*). As described below, the resulting genealogy Aq^_-^ 
differs from the previous genealogy al^'-x-i in at most one component for each t. 

To establish that the CPF-BS kernel is reversible, we introduce the following gener- 
alisation of Ko;T-i{aQ^_i,x^^,n*,daQ.^_-^^), denoted Kj{aQ.^_iXQ^_i,n*,daQ.^_^) for J = 
{ti, . . .tk} C : T — 1, with ti > . . . > tk, which corresponds to the following partial BS 
update. 

CPF-3' Let B^ = N\ then, recursively for t = T - 1 to 0, do: if t ^ J, set B* = Af*+' 

(deterministically) , otherwise (if t & J), sample index B* = A^^'^^ ^ 1 : N, conditionally on 

B^^^ = b, from the distribution 

(11.1) 

vr^ {Al = a,^|i?,+i = 6, X',:^ = xj:^. A;' = a^") ex ft (x^, A' = 4\A7' = «*"") m,^,{xf\xl,) 

Finally, return Z^ = (A(f°, . . . , A^^). 

Clearly, the kernel Kj{aQ^_i,XQ.^ ,n*,daQ^_{) is such that the resulting Aq-t-i differs 
from cto-T-i by at most one component in those vectors Aj'^ such that t ^ J. 
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Recall that the standard CPF kernel (with step CPF-1 and step CPF-2 only) is denoted 
by P^ , and let P^ denote what we call from now on the CPF-BS(J) kernel, which 

operates Steps CPF-1, CPF-2 and CPF-3', where the latter is implemented for a specific 
J C : T — 1 indicated in the superscript. 



Proposition 15. Assume (R). For any J G : T — 1, the kernel CPF-BS(J) with the 
additional step CPF-3' (partial backward sampling) is reversible. In particular, the CPF-BS 
kernel with Step CPF-3 (complete backward sampling, J = : T — 1) is reversible. 



Proof. The proof will proceed by mathematical induction on the cardinal of J. Let J 
{t} C : T — 1, and denote kernel Kj as Kf. One has 



^^N^pN,B{J) {h{Z^, N*, Po:T-l5 ^T5 ^*' ^O-.T-l)} 

X hyZrp ,n ,OQ.rp_i,Zrp ,n , Oo;T-ly 



where ti^ {dn*\x}^.^) corresponds to CPF-2 (and our notations reflect the fact that this update 
of A^* does not depend on a]^'.fr-ii ^ property wish we use repeatedly), Kt^a^.^^^, Xq.^_^, h*, ddQ.^_i] 
corresponds to CPF-3', and zJf and feo:T-u resp. zJf and &o;T-1' rnust be understood as de- 
terministic functions of (a:Jl^, aQ:^_;^,n*), resp. of (a^g-T) ^oiT-i;"^*); i-^- ^t — (%°5- • • i^t)^ 
with 6y = n*, h^_i = a^li, and so on. The above quantity may be decomposed as 



/N ( J 1:N J l:N ^ *\ I ^f^^*] ^'■I^\T(^ ( ^'^ 1-^ "* rl~^-^ \ 
TTrri {dXn.'j^ , aCin.n-' i , (XTl j j TTnn [(ITl \Xn.q^ j IV^yCin.'j^ i , CC r\.^ i , Tl , ddn-'T 1 ) 

X h{Zrp ,n ,bQ.rp_^,Zrp ,1% , 00;T- 1 J ^j b*_|_ ^ =b*_|_ J 

~r / TCn-' iClXri.'Y' J ddr^.'-p i , (X7X j / TXrp [(271/ U^n-T^ j ^Q-T 1 ) '^t\^0-'V 1 j '^O-T 1 5 ^ 5 CKXci.'-p i ) 

X h[zrp ,n ,bQ.2'_i,Zrp ,n , OQ.y_i)I|b*^^_^^*^j. 
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The first term of tlie sum may be rewritten as 



/ Tlrp i^ClXQ.rp , CI(Iq.'J^_1, dn ) I Tlrp (^dn \XQ.rp ,(lQ.np_J^)l\t(^(lQ.rp_^, XQ.rp_^,n ,d(lQ.rp_^j 
/ Tlrp [dX^.rp ^d(lQ.rp_-\^)TXrp [dn \X Q.rp j l\ f[OjQ.rp _T^^ X Q.rp ^^^ Tl ,d(lQ.rp_^j 

X / Tlrp ydn |2^0:T / tv'^OiT— 1' '''0:T— 1' ^ ' '^'^0:T— 1/ 

11 rri \OjX r\.rT^ ^ LiCir\.rp 1 ) TT'yn [(111 M^H-T^ / '^ t\ U'T^ 1 ? D'T' 1 "> "> ^^H'T^ 1 / 

X / Tlrp i^dn \XQ.rp )l\t\(^Q-T—ly-^0:T—lJ^ y^^O:T—l) 

<yh{Z^, N*, BQ.rp_-^, Z^, N*, -Bo:T-i)^{B*+i=B*+j| 



--E 



r^®P^-^(^) 



wfiere, in tlie first equality, we liave used tlie fact tliat kernel Kt leaves invariant ti^ {^^o'-t-i \^o'-t-> ^* 
and thus a valid generative mechanism for 7r|f is ( X^':^ ^ ^q't-h ^* ) ~ '^t i then A^.^_^\Aq.j<_^ = 

aQ.^_i,XQ:^ = Xq^,N* = n* r^ Kt(aQ.^_i,x^^_i,n*,daQ.^_i), and, in the second equality, 
we have use the following property: since the genealogies a^^_^ and aQ^_i differ only with 
respect to one single component a", with n = &*+i, then for any pair {n*,n*) such that 
b*_^_l = b^_^_-^ (an event that is entirely determined by n*, n* and al':^^.rp_-^ = a^j,^.y_;^), the dis- 
tributions Kt{aQ.^_-^^, Xq^_i, h* , ddQ.^_^) and Kt(aQ.^_^,x}^^_^,h*,ddQ.^_{) are identical, as 
they both reduce to updating that particular component a" with n = b^_^_i = b^^i which 
is the only difference between ag-T-i ^^^ ^o-t-i; ^^^ therefore sample from a distribution 
conditional on either a^' or d^' , which are equal. The last equality is a simple change 
of variable based on the symmetry of the previous expression. 
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Now for the second part of the sum: 



/ ^T" [O'Xrt.'Y' •) Qjdrk.rp 1 5 QiTl j j TTrp {(172' Xq.t^ j Ivfydrt.rp 1 ^ Xrk.n-' , TZ J ddn.n^ i , 

X h{Zrp ,n ,bQ.rp_-^^, Zrp ^fl , Oq. J,_ ]^ ) I| J,*_^ ^ _^ J*_^ J 
/ TXrr> [dXn.'-T' 5 ddr\.rp -i jTTrp ( (xTi Xq.'T' )J\^( Aq.'T' 1 , OCci.'T^ , 71/ , ddr\.rp -[ ) 



X / IXrp [d71y \'^c\-'T ji^yZrp ^ 7X J ^O'T^ 1 ") ^T ") ^ ") O'T 1 / 



w+l^'^t+l} 






where in the second inequahty we have replaced tc^ {daQ:^_^\xlf^)Kt{ay.^_i, Xq.^ , n*, dd^^_{), 
the joint distribution of {Aq.^_^, Aq^_{) conditional on X^:^ , by vr;^(c?ao:T-il^oiT )'^ai-^ ('^'^oIt-i)! 
using the following two properties: (a) the marginal distribution of aj^^.]^ relative to that 
joint distribution is indeed n^ {ddlf.^_i\xQ.^) , since Kt leaves it invariant; and b) the value 
taken by the integrand is unchanged is we replace the conditional distribution of Aq.^_i, 
given ^J:^_]^ by Sq^i-.n (da^^^i), owning to the fact that, provided B*_^_^ ^ -B*+i) the values 
taken by (Z^,i?o:T-i) depends on components of the genealogy A\^^ that are identical in 
Aq!^ and ^o:T (o^ i^ other words, components that have not been affected by K^. (Recall 
that in these calculations, {z^* ,n*, &o:t-i)' resp. (z^*, n*, ^g.^..^), must always be understood 
as a certain deterministic function of {xQ.^,aQ.^_^,n*),Yesp. {xq.^ , dlf^_i, h*) .) 

One may conclude by using the fact the CPF kernel P^ itself is reversible, per Proposition 



10, and by combining the two terms of the sum above. 

Let k > 2, and assume reversibility for all J C : T — 1 such that \J\ = k — 1. We show 
reversibility for J = {ti, . . .tk} of cardinality k. Let / = {t^j , . . . ,t^} <Z J with U^ > . . . > t^, 

I < k . The idea is to decompose the expectation 
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(I ry-i QS'x'Vti 



into a sum of 2^ terms, corresponding to the 2^^ subsets I G J: 

/ '^T' l^"^n'7^ 5 OjQjr\.rTi 1 I / TTnn \(Xlh M^H-T' } -^ I \^C\-7^ 1 1 "^D-T^ 5 5 Oj(Xr,.rp i ) 

X / TTrp[dn \XQ.J')l\.J\(lQ.rp_l^XQ.rp^n , aCLQ.rp^^j 

TT'T' 1 CiXn-'T^ 5 CiCir,.^ 1 ) / TTrp [CLTl \Xr\.j^ j J\ J [drj.n^ i , Xn.n-' , TZ , ddn.n^ i ) 

n [dn \XQ.rpjKj\ClQ.rp_^^XQ.rp^n idClQ.rp^^j 




T' 

The first equahty follows from the definition of n^ ®Pj, . The second equality uses the 
fact that Ki leaves invariant tt^^, and therefore simulating (^Xl':^ ^ ^o-t ' -^*) from t^t i then 
updating Aj:^ into A}^.^ through Kj ensures that {Xl':^ , A}^:j^_^, N*) ~ n^ . The third equal- 
ity is based on the fact that, conditional on aQ.^^^, it is possible to exchange the order in which 
genealogies floi^-i ^^^ ^o-.t-i ^^^ generated without changing the value taken by the integ- 
rand; or in other words to replace Ki{a^'J^_^, Xq'-ti ^* i '^'^0:T-i)-^j('^0:T-15 ^o-t 5 ^* i "^^o-r-i) by 
Kj{aQ:^_^,XQ\^ ,h*,daQ^_i)Ki{aQ^_i,x^^,n*,daQ.^_i). To verify this, given (aQ.^_i,n*,n*), 
consider applying the kernels in the order Kj and then Kj which gives rise to the sequences 
Oo-T-i ^^^ ^o:T-i ^^ch that the constraints are satisfied by the genealogies &o:T-i ^^^ K-.t-i- 
Then these same genealogies feo-T^-i ^^^ ^ot-i '^^^ ^^^^ be generated by reversing the order 
of the application of the kernels Kj and Kj while retaining the same probability of realizing 
them. To see why this is so, firstly, we note that the probability of realizing &o-t-i i^ ^^^ 
same under Kj{aQ.^_-^^,x}^^,h*,ddQ.^_^) or Kj(aQ.^_^,XQ.^,n*,dd}^^_^) because Kj does not 
alter the variables aQ-x-i ^^ ^^^ times t G J\/; and at times t & I when Kj does make 

changes the constraints bf_^_i = b*^-^^ for t G / enforces d^ ^'^^ = a^ *^^ = a^ ^'^^ . Finally, 
whether Kj operates on Oo-t-i or aji^^.i does not alter the probability of realizing genealogy 
K-T-i- This is in part because at the times t G J\I where Kj alters aQ.^_i the constraints 
K+i 7^ K+i for t G J\I are in force, which ensures the same sequence of ancestors bQ.j^_ioi 
n* can be generated if Kj acted on dQ.^_i by only resampling n*'s genealogy at times t E I. 
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— b* —b* —b* 

Furthermore, the constraints b'l^i = b*_^_i for t G I enforces a^ *^^ = a^ ^^^ = a^ '^^ which 
means the probabihty of reahzing &o'r-i i^ unchanged whether Kj acts on a^'-x-i or aQ.^_i. 
The fourth identity uses that vr-^ is invariant through Kj, and marginahses out aj:^_p 

One concludes by using the fact that Pj,' is reversible for any I ^ J, and combining the 
2^ terms corresponding to the above decomposition. D 

C. Lag-one dominance under Assumption (R). We establish first the following lemma. 

Lemma 16. Assume (R). Then for < k < T — 1 and any function g and h of their 
respective arguments, one has 



= ^TT^^P^ |fi'(-^0:T) -So:T-l5 X*)h{Xk:T-> ^t.T-l^ ^*) \B*y^B*:T~l>i>k]^[N*y^N*]j 

with the convention thatl<g*,j^*.j._^y^^rp_^-. = 1, where ((Xq.j., A^*, Bq.j^_j), (Xq.j., N*, Bq.j^_i)^ 

is distributed according to n^ §>> -P^ ' in the first expectation and n^ (8> P^ in the second. 
Also, 

= ^n^(g,P^ |fi'(-^0:T) Bq.T_i, N*)h{XQ.j., BQ.r^_^, N*) I[5*_^^*.2._i>j>]qI[^*_^^* 

and 

^7r|f®P^'-S {diXQ.rp, Bq.j,_^, N*) I[Ar*=Ar*] } = E^N^pN [g{Xlj,, BQ.jr_^, iV*) I[Ar*=jv*] } . 

Proof. Expanding the first integral gives 



^ '^(^fc:T5 "fciT-l)'"' ) fi'(%:T5 "0:T-1)'^ >\b*^b*:T~l>i>k]Hn*^n*] 

X / 7rrp[dn \xQ.j^)h{xi^.rp,bi^.rp_i,n ) g{xQ.rp,bQ.j<_i,h ) I[b*^fe*:r-i>i>fc]I[n*7^n*] 

TTrp [dXf^'.rp ,daf^'.rp_l) / TT J, ( OU \XQ.rp) / A 0;T- 1 ( Oq iT" 1 ' -^O^T- 1 ' ^ ''^^O-T-l) 

X / 7T^{dn*\xl:^) / 6^v.N_^{dal)::^_Mxl.T^K:T-i,n*)g{xl.T,blT~i^n*)^^^^ 



Recall that in the first line we write rc^^dn^lxQ.^) instead of t^t id''^*\xo:T ^ '^o-.t-i) ^^ ^^^ 
update of N* does not depend on aQ.x-i^ ^ property we use repeatedly. The second line 
introduces the variables clq-t-i ^^^ changes the order of integration for N*. The final equality 
introduces the measure Sq^i-.n {daQ.^_i), which can be done without altering the value of the 
integrand of the outer most measure n^ {dx}^^ , da}^^_i) since h{x\.rp,b\.j._^,n*) does not 
depend on {x^.j^_^, K-.k-i) ^^^ ^^^ event \b*^b*:T-i>i>k]^n*^n*] holds. The proof of the second 
and third assertions follow similarly. D 
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We know proceed with the proof of Proposition 14 
Proof. It is enough to show that 



< E.^pM {hiZ*^, 5oVi, N^MZ^T. K.T-i. N^)] 



-.N.B 



since both PJ^'" and P^ have ttt as their invariant measure. To estabhsh the inequahty, we 
will compare both the expectations above on the mutually exclusive events I^Q*^^*]I^^*^j^*.rpy^y^ 
for k = 1, . . . ,T and Ir5*^^*.2->j>oi and show that the stated inequality holds on all these 
events separately. We adopt the convention that B^ = N*, B^ = N* and \B*^B*-T>i>T] — ^■ 
For the event I[5*_^^*.j.>j>q], by Lemma 



16 



E 



iV^piV. 



n^(S)P^- \ 



,. |/i(Z*, B*.^T-i, N*)h{Z*T, 5oVi, ^*)I 



[B*j^B*:T>i>0] 



E^N^pN <i h{Z^, -Bo;T-i> N*)h{Z^, Bq.t_^, N*)1^b*^b*:T>; 



>o] f • 



Consider now the remaining events. For the sake of terseness, let h = fk + Qk with 

fe-l T-l 

fki^Q-.k^ Bo:k) = z2 ^*(^*Wi' Bli+i), QkiXlr, Bl.j,_^, N*) = 22 hi{X*.i^^, B* 



i+l) 



i=0 



i=k 



with empty sums regarded as zero, i.e. /o = 5't = 0. Thus 

= {fkiX^.k, Bq.^) + gkiXlj,, BIj,_^, N*)) {fkiX*.k, Bq.j.) + gkiXlj,, Blj,_^, N*)) 
For the product fk{X*.k,B*.j^)fk{XQ,k^B*.k), note that 



From Lemma 



fk{Xo.k, B^.j^) + fkiX^.j,, B^.j}) > 2/fc(Xo.^, Bl.j,)fk{X^.j^, B^. 
and the reversibility of Prp ' and P^ (Propositions 



16 



10 



and 



12), 



E 



JV^nJV. 



B |/fc(-^0:fc,-Bo:fe) \Bl=Bl^[B*^B*:T>i>k] 



=E 



n^^p^''' \fkiXQ,k,BQ.j^) \bi=bi]\b^=^bi.t> 



i>k] 



J^Tr^CgipAf |/fc(-'^0:fc,-Bo:fc) I[B*=B*]I[B*^B*:T> 



i>k] 



^Tv^^P^ \ fk{XQ.k, Bq.j,) I[p-.^;^*^I[B*y^B*:T> 



H>k] 



where the first equality uses reversibility, the second Lemma 16 (using that Ir^*^^*i = 

X;„I[i?^=n]I[j3*=„], and taking h{xk:T,h:T) = I[bfc=n], 9{xo:tM:t) = fkixo:k,bo.,k)%bk=n]) and 
the final reversibility again. Thus 



^TT^CEJP^'^ {fk{Xo.,k, BQ.i^)fk{XQ.k, Bq.;^) I[j^.^Bl]\B*^B*:i>k] 
^^TT^^P^ \^fk{XQ.k, Bq.jS) I[Bl=Bl]^[B*^B*:i>k] 
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For the 3 remaining terms of the product h{Z^, BQ.rp_^, N*)h{Z^, BQ.rp_^^ N*), we again invoke 
the estabhshed reversibihty of tt^ (g) Pj, ' and vr^ P^ and Lemma 



16 



but this time to 
show that the expectation is the same under both tt^ (g) Pj. ' and tt^ (g) P^ . Details are as 
follows: 



I rp '■^±rp 






then 



i>k] 



JEyr^C^Pf I 9k{,Xl.^, Bl.T_i, A^*)/fc(Xo.^, Bq.^) I[b*=j3*]I[b*_^j3* 



and 



- "^-K^^P^ \^9k{Xl.j., -Bfc:r-l5 X*)fk{XQ.j^, Bq.j.) l^Bl=Bl'^[B*^Bl.i>k] 



>k] 



= E^^Cg)P^ [9k{Xlj,, Pfc-T-l, X*)gk{Xlj,, ^fc:T-l) ^*) I[i3*=P*]I[B*^P*:i>fc] 

which concludes the proof. D 
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